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ABSTRACT 


It  has  now  been  twenty  years  since  the  publication  of  the  original  paper, , "An 
Optimum  Character  Recognition  System  Using  Decision  Function"  by  C.  K,  Chow  in  IRE 
Trans,  on  Electronic  Computer  In  1957,  in  which  he  formulated  pattern  recognition  as 
a problem  of  statistical  decision  theory.  During  the  last  two  decades,  statistical 
pattern  recognition  was  well  developed  in  theory  and  applications  with  the  peak  ac- 
tivity in  the  late  sixties.  The  area  has  now  reached  a fairly  saturated  condition 
as  its  capability  and  limitations  are  well  explored.  The  limitations  are  obvious: 
the  patterns  are  not  characterized  by  the  statistical  information  alone  and  many  use- 
ful statistical  properties  cannot  be  fully  developed  with  available  mathematical  sta- 
tistics. The  paper  outlines  Important  but  unsolved  problem  areas  in  statistical  pat- 
tern recognition  and  then  takes  a new  and  close  look  at  some  problems  which  are  re- 
lated to  the  finite  sample  size  constraint.  In  an  effort  to  bridge  the  gap  between 
theory  and  practice,  constructive  solutions  are  provided  for  the  problems:  finite 
sample  distance  and  information  measures,  finite  sample  nearest  neighbor  decision 
rule,  contextual  analysis,  decision  rules  baaed  on  discrete  and  continuous  measure- 
ments, and  the  finite  sample  stochastic  syntax  analysis.  It  is  concluded  that  there 
are  still  many  challenging  problems  to  be  solved  in  statistical  pattern  recognition 
and  every  effort  should  be  made  such  that  the  theory  works  well  in  practice. 
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A New  look  at  the  Statistical  Pattern  Recognition 

1.  Introduction 

It  has  now  been  twenty  years  since  the  publication  of  the  original  paper,  "An 
Optima  Character  Recognition  System  Using  Decision  Function"  by  C.K.  Chow  in  IRE 
Trans,  on  Electronic  Computer  in  1957,  in  which  he  first  formulated  pattern 


recognition  as  a problem  of  statistical  decision  theory.  During  the  last  two  decades 

i 

statistical  pattern  recognition  was  veil  developed  in  theory  and  applications  with 
peak  activity  in  late  sixties.  The  area  has  now  reached  a fairly  saturated  condition 
as  its  capability  and  limitations  are  well  explored.  The  limitations  are  obv'.-ious: 
the  patterns  are  not  characterised  by  statistical  information  alone  and  even  some 
Important  statistical  properties  cannot  be  developed  with  available  mathematical 
statistics.  This  situation  has  been  quite  typical  with  the  application  of  every 
branch  of  mathematics.  However  for  researchers  new  or  old  to  this  area  there  are 
still  many  challenging  problems  remaining  to  be  solved.  In  {1],  ten  problem  areas 
where  tbe  solutions  are  . . wanted  are  listed,  not  necessarily  in  the  order  of 
importance,  as:  feature  extraction,  nonstationary  patterns,  adaptive  systems, 
learning  complexity,  finite  sample  sice  effects,  computational  recognition 
complexity,  contextual  analysis,  optimum  pattern  recognizer,  statistical  and 
syntactic  mixed  model,  and  the  automatic  generation  of  recognition  rules  for 
complex  patterns.  It  is  hoped  that  good  solutions  to  some  of  these  problems  will 
become  available  in  the  next  decade. 

It  is  not  intended  to  survey  the  area,  which  is  now  quite  broad,  in  this  paper. 
Many  books  and  articles  have  dons  this  survey  well.  Instead  the  paper  takes  a new 
and  close  look  at  certain  problems  in  statistical  pattern  recognition  and  offers 
some  constructive  solutions.  Particular  attention  is  given  to  bridging  the  wide 
gap  between  theory  and  practice,  notably  the  problems  of  finite  sample  constraint. 

II.  Finite  Sample  Distance  matures 

Dlatance  measures  are  useful  for  feature  selection  and  extraction  and  for 
error  botmda  of  Bayaa  error  probability  (aee  e.g.  [2],  Chapter  4).  They  have  been 


F 


extensively  examined  la  recent  years  under  the  assumption  of  large  sample 

sire.  In  practice  the  sample  sice  may  be  limited  or  small  and  many  conclusions 
drawn  under  Infinite  sample  assumption  may  not  be  valid  under  finite  sample  con- 
straint [3].  The  discussion  here  will  be  limited  to  Gaussian  measurements  for 
divergence  and  Bhattacharyya  distance  but  can  be  easily  extended  to  other  cases. 

Consider  first  the  case  of  two  univariate  Gaussian  densities  with  means  m. 


variance  o . Last  "A,t  denote  the  quantity  evaluated  by  using 
i.  Then  the  difference  In  divergence  between  infinite  sample  and 


Is  always  positive  where  and  Nj  denote  the  numbers  of  samples  for  classes  1 and 
2 respectively.  It  is  also  assumed  that  all  samples  are  statistically  Independent 
The  positive  bias  given  by  Eq.  (2)  Indicates  that  the  divergence  evaluated  by 
using  a finite  nuaber  of  samples  can  lead  to  an  over  optlministic  estimate  of  the 


error  probability 


Next  consider  the  univariate  Gaussian  densities  with  zero  means  and  variances 


The  ratio  u «o‘/o,  has  the  F-dlstrlbutlon  with  (M. , N.)  degrees  of  freedom.  The 


By  using  the  Taylor  series  expansion  of  B with  respect  to  the  true  value  B,  and 
retaining  terras  up  to  the  second  order  in  the  expression,  we  obtain 


>.  1 + ^2  and  positive  otherwise.  As  the  sample  sizes 


which  is  negative  for 


approach 


However , the 


S<w  consider  the  multivariate  Gaussian  densities  for  p-dimensional  oeasurements 

let  and  be  the  sample  mean  vectors  corresponding  to  the  true  mean  vectors 

and  p2  classes  1 and  2 respectively.  Also  let  S be  the  sample  estimate  of 

the  common  covariance  matrix  £ given  by 

Nx  Nj+H2 

* ■ sr*  «r-  * ' l - V + . X,  «i  - V«i*  V>  (7) 


For  infinite  sample  case  the  divergence  is 
J - (Px  - ji2)'  - y2) 

which  is  the  same  as  the  liahalanobis  distance. 


where  EtS] 


Let  k * — + =—  . The  random  variable  J/k  has  Hotelling's  T non-null  distribution 
1 2 

(see  e.g.  [4])  with  il.  + iX  samples  and  p degree  of  freedom  given  by 


(p/2)+r-l 

2T0f+n+? 


Che  expectation  of  J can  be  written  as 


1 J f - p + i T f - p + i Hi 

For  p • 1,  then  f - 1 aNd  Eq.  (II)  is  the  same  as  Bq.  (2).  For  p > 1,  E(i)  differs 

. • . . / 

from  J not  only  by  an  additive  term  depending  on  sample  size  but  also  by  a multi- 
plicative constant  independent  of  the  sample  sizes.  The  undeslred  effect  of 
finite  sample  size  is  quite  evident  from  Eq.  (11).  For  equal  mean  but  unequal 
covariance  case  the  Bhattacharyya  distance  experiences  the  same  effect  as 
divergence  since  B » J/8. 


For  two  multivariate  Gaussian  densities  with  zero  mean  vectors  and  covariance 
matrices  and  ^ whose  unbiased  estimates  are  and  respectively,  the 
divergence  based  on  sample  estimated  parameters  is 


Since  the  measurements  from  the  two  classes  are  independent. 


tr{E(V. )E(V-  ) + E(V,)E(V.a)}  - p 


follow  the  Wlshart  distributions  with  expectations 


where  the  bias  term  coincides  with  Eq.  (4)  for  p • 1,  l.e.  the  univariate  case 


The  above  discussion  clearly  illustrates  the  effect  6f  finite  sample  on  the 


determined.  In  general  the  estimated  divergence  has  positive  bias  while  the 
behavior  of  Bhattacharyya  distance  is  less  predictable. 
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III.  Finite  Sample  Information  Measures 

For  feature  selection,  more  informative  features  result  in  low  classification 
errors.  However,  if  the  sample  size  is  limited,  information  measures  estimated 
from  samples  may  not  be  as  effective.  Consider  the  equivocation  for  m classes 
defined  as 

m 

(16) 


H - - E[  l P(«./X)  log  PO^/X)] 
i=l  1 


where  P^/X)  » PA  is  the  a posteriori  probability  of  the  ith  class  and  the 


expectation  is  taken  with  respect  to  the  space  of  X. 

The  sample-based  equivocation  using  the  estimated  a posteriori  probability 


P±  is 


m 


H - - E [ £ *±  log  PJ 
i®l 


(17) 


Let  ^ be  the  parameter  of  the  ith  class,  and  0^  its  estimate.  Assume  that  the 


effect  due  to  sample  size  is  small  so  that  we  need  consider  only  the  first  two 

A 

terms  in  the  Taylor  series  expansion  of  P^, 


Pi  “ Pi  + Pi(9i  ” 9i) 


(18) 


where  P^  is  the  partial  derivative  of  P^  with  respect  to  0^  evaluated  at  0^  ■ 0^. 


The  difference  between  the  estimated  and  true  equivocations  can  be  written  as 

,2, 


H - H - E l [P’(01  - 91)(1  + log  P±)  + P'(0  - 0t)  ] 

i-1 


(19) 


which  is  still  a function  of  6^.  If  0^  is  an  unbiased  estimate  of  8^,  then  the 


expectation  of  the  difference  with  respect  to  the  estimated  parameter  depends  only 


on  the  variance  of  0^  which  is  usually  inversely  proportional  to  the  sample  size. 


A 2 


The  variance  of  H given  by  E(H  - E(H))  where  the  expectations  are  with  respect 

A A 

to  0^  can  be  shown  to  be  proportional  also  to  the  variance  of  6^  or  inversely 
proportional  to  the  sample  size.  For  the  Renyi's  information,  it  has  been  shown 
[5]  asymptotically  that  the  sample  estimate  of  the  information  can  have  a variance 
of  the  order  of  the  reciprocal  of  the  squared  sample  size  under  certain  condition. 
Thus  the  equivocation  estimated  under  finite  sample  can  be  quite  Inaccurate. 
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IV.  Finite  Sample  Nearest-Lfeighbor  Decision  Rule 

The  nearest-neighbor  decision  rule  (WNDR)  is  attractive  in  the  sense  that  the 

NN-risk  is  upper  bounded  by  twice  the  Bayes  risk  when  the  sample  size  approaches 

infinity.  For  a given  sample  size  the  1-N11DR  is  uniformly  better  than  the  k -KNDR. 

n 

The  small  sample  NNDR  performance  has  been  considered  for  the  restrictive  cases: 

Fix  and  Hodges  [6]  Investigated  the  small  sample  performance  for  k^-NNDR  for 
univariate  and  bivariate  normal  distributions;  Levine  et.al.  [7]  showed  that  the 
performance  for  small  sample  sets  from  uniform  distributions  is  close  to  its 
asymptotic  value.  For  multivariate  Gaussian  densities  and  allowing  the  sample  size 
to  increase  with  k,  it  is  shown  [8]  numerically  that  the  k-MNDR  has  a very  close 
performance  as  the  Bayes  linear  discriminant  analysis.  This  result  is  significant 
in  the  sense  that  under  finite  sample  condition  the  NNDR  is  comparable  to  the  Bayes 
rule  using  estimated  parameters.  For  a given  set  of  n samples  with  knowr> 
classification,  it  would  be  more  meaningful  to  compare  different-  decision  rules 
using  the  n samples  rather  than  to  compare  with  Bayes  rule  under  infinite  sample 
size  assumption.  For  the  Gaussian  assumption  the  NHDR  is  very  competitive  with 
other  decision  rules  based  on  the  results  of  [8]. 

In  practical  use  of  nearest-neighbor  rule,  the  large  number  of  samples  would 
require  large  amount  of  storage  and  computation.  Methods  for  reducing  the 
computation  requirement  and  editing  the  samples  have  been  considered.  It  has  been 
established  experimentally  that  there  is  always  a small  subset  of  good  learning 
samples  that  dominate  the  performance.  In  other  words  the  performance  would  be 
insensitive  to  sample  size  for  good  quality  neighbors.  This  idea  is  somewhat 
similar  to  the  edited  NNDR  which  attempts  to  eliminate  samples  on  the  wrong  side 
of  decision  boundary. 

The  fundamental  question  whether  the  Euclidean  distance  is  most  effective 
in  NNDR  has  not  been  resolved.  Experimental  results  based  on  some  weighted 
Euclidean  distance  have  indicated  better  recognition  performance  than  on 
Euclidean  distance.  If  the  samples  are  close  to  be  normally  distributed,  a better 
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distance  computation  makes  use  of  the  covariance  matrix  for  each  class,  i.e.  for 
each  neighbor  belonging  to  the  ith  class,  compute  the  squared  distance 

(x  - p<i>),V~1(x  - P(i))  - d^  (20) 

2 

and  choose  the  class  which  provides  the  minimum  d^.  The  Euclidean  distance  may 
be  considered  as  a special  case  of  Eq.  (20)  by  setting  * I.  If  the  nearest- 
neighbor  is  considered  as  the  reference  point  of  a class  then  for  the  minimum 
distance  classifier  provided  by  the  Euclidean  distance  NNDR,  there  exists  a linear 
decision  function,  obtained  by  using  common  convariance  matrix,  which  is  at  least 
as  good  according  to  the  classification  theory  (Chapter  2 of  [2]).  If  the 
covariance  matrices  are  quite  unequal  then  the  quadratic  classifier  provided  by 
Eq.  (20)  is  better  than  the  linear  classifier.  Exactly  how  much  better  is  a 
question  better  answered  experimentally.  Recent  investigation  with  the  seismic 
data  [9]  has  shown  that  the  modified  distance  given  by  Eq.  (20)  provides  more  than 
15%  improvement  in  the  percentage  correct  recognition  than  the  Euclidean  distance 
NNDR.  Of  course  the  sample  size  must  be  large  enough  to  calculate  the  covariance 
matrices  accurately. 

In  summary,  the  NNDR  is  an  effective  and  reliable  decision  rule  for  finite 
sample  size  condition.  Especially  for  small  sample  size  when  a good  estimate 
of  parametric  density  is  not  available,  the  NNDR  should  be  used. 

V.  Contextual  Analysis 

A major  weakness  of  statistical  pattern  recognition  is  the  difficulty  to  take 
the  contextual  relations  into  account  in  the  recognition  process.  Character 
recognition  is  not  considered  here  as  it  requires  somewhat  different  contextual 
analysis  [10],  An  imagery  pattern  is  rich  in  contextual  information  part  of  which 
is  statistical  in  nature.  A formal  statistical  approach  to  this  problem  is  the 
compound  decision  theory.  Hie  finite  sample  constraint  in  digital  imagery 
patterns  is  caused  by  limited  number  of  images  and  the  limitation  in  spatial 
resolution.  In  image  interpretation  and  classification,  an  image  Is  usually 
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partitloned  into  a number  of  subimages.  A vector  measurement  may  be  taken  from 
each  subimage.  By  assuming  dependence  on  the  nearest  four  neighbor  nub images , 
the  compound  decision  rule  is  to  choo-se  the  class  which  maximises  (a^.,  ■ 1,2, 


m) 


4 


f ^ J}  P<y“j)p<“j/V  <2i) 

which  is  adapted  from  the  last  equation  on  page  201  of  [2].  Here  is  the  vector 
measurement  for  the  kth  subimage.  Notice  that  the  part  of  the  expression  outside 
of  the  product  sign  is  identical  to  that  used  in  a simple  maximum  likelihood 
decision  rule  without  considering  neighbor  sub  Images  at  all.  The  product  tr-erm 
represents  the  contextual  Information  for  the  kth  subimage.  Each  multiplier  in 
the  product  term  represents  the  contextual  contribution  from  an  adjacent  neighbor 
sub image.  By  rewriting  the  multiplier  as 


wz  p<yuj>p<u 


p<yi> 
r p(«o 


(22) 


it  is  seen  that  computationally  this  is  a weighted  histogram  of  the  subimage  j , 
with  each  class  being  weighted  by  the  factor  POiij/uj^/PGtfj)  which  reflects 
the  dependence  between  two  states  of  nature  for  two  adjacent  subimages  k and  j. 
The  accuracy  of  the  weighted  histogram  is  related  to  the  performance  of  compound 
decision  rule  given  by  Eq.  (21).  The  sampling  distribution  of  the  weighted 
histogram  for  q quantization  levels  and  a total  of  n pixels  for  the  jth  subimage 
is 

3 


r<rij  + i>— + l)  JjV 
where  r is  the  number  of  pixels  belonging  to  the  ith  quantization  level.  The 
Bayes  estimate  of  P.-,  the  fractional  number  of  pixels  for  the  1th  level  is 


(23) 


ij 


n + q 


(24) 


By  using  Eqs.  (21)-(24)  and  following  the  analysis  of  [11],  an  average  probability 
of  correct  recognition  for  the  subimage  k can  be  determined  as  a function  of 
sample  size  (l.e.  the  number  of  pixels  n),  and  the  number  of  quantization  levels  q 
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f.-i  ! 


If  we  consider  two  classes  only  such  as  object  (u^  “ 1)  and  background  (u^  "2), 
then  the  effect  of  contextual  dependence  appears  as  a multiplicative  factor  in  the 
likelihood  ratio.  A suboptlmal  but  much  simpler  scheme  to  determine  such 

factor  is  to  compute 


4 PO^  - 1/t^  - 1)/P(u.|  - 1)  + PO^  - 2/a^  - 1)/P(u1  - 2) 

- l/i^  - 2)/P(d>j  - 1)  + PO^  - 2/l^  - 2)/P(uj  - 2) 


(25) 


.,s  the  histograms  computed  for  all  four  neighbor  subimages  tend  to  cancel  out  in 
the  numerator  and  denominator.  Other  simple  ways  to  profitably  utilize  the 
statistical  contextual  information  in  image  analysis  should  also  be  examined  both 
theoretically  and  experimentally. 

VI.  Decision  Rules  Based  on  Discrete  and  Continuous  Measurements 

Host  pattern  recognition  work  assumes  either  discrete  or  continuous  measure- 
ments (including  measurement  quantized  from  the  continuous  one).  In  image 
recognition>  it  is  possible  to  tentatively  assign  each  subimage  to  one  of  several 
possible  classes,  which  is  a discrete  quantity,  while  the  actual  measurement  of 
the  sub image  is  continuous.  In  the  decision  tree  framework,  an  overall  classifica- 
tion . of  the  image  may  be  made  by  using  all  the  informations  on  each  subimage 
Including  the  preliminary  decision  made  on  it.  Similar  situation  arises  in 
medical  diagnosis  in  which  the  final  diagnosis  depends  on  decisions  made  on  some 
tests  and  other  continuous  measurements. 

Recently,  Krzanowskl  [12]  considered  the  use  of  Fisher's  linear  discriminant 
function  for  classification  with  a set  of  p continuous  and  q binary  variables. 

His  work  is  immediately  applicable  to  medical  data.  From  the  information  provided 
by  the  discrete  variable,  a likelihood  ratio  is  formed  on  the  continuous  variable 
and  compared  with  a threshold  determined  by  the  discrete  variable.  For  image 
analysis,  a decision  or  interpretation  has  to  be  made  on  an  image  containing  a 
number  of  subimages  on  which  individual  decision  may  be  made  first.  A bottom-up 
decision  tree  mar  he  established  to  reach  the  best  final  decision.  Inconsistent 
decisions  between  two  neighbor  subimages  may  indicate  the  existence  of  an  object 
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boundary  or  an  Incorrect  decision  on  one  of  them.  Backtracking  or  error  correction 
mechanism  may  be  added  to  the  decision  making  process.  The  size  of  the  subimage 
should  be  chosen  so  that  it  is  much  smaller  than  the  object  size.  The  decision 
process  can  be  summarized  as  follows: 

Step  1.  Starting  from  the  first  sub image,  compare  its  decision  with  all  eight 

neighbors.  Proceed  next  with  one  of  the  subimage  of  same  decision.  If 
no  consistent  decision  is  available,  proceed  with  any  neighbor. 

Step  2.  Examine  the  second  sub image  in  the  same  manner  as  Step  1.  Repeat  the 

step  as  many  times  as  needed  until  returning  to  the  first  subioage  with 
closed  boundary.  The  desired  object  is  located. 

Step  3.  If  a closed  boundary  is  not  available  after  search  and  merge  in  Steps  1 
and  2,  then  decision  is  made  that  the  image  does  not  contain  the  object. 
Obv~iou^sy  other  decision  tree  procedures  can  be  established  (see  e.g.  [13]) 
for  the  same  objective.  These  procedures  are  much  easier  to  implement  than  the 
use  of  "one  shot"  compound  decision  function. 

VII.  Finite  Sample  Stochastic  Syntax  Analysis 

The  production  probabilities  in  stochastic  syntax  analysis  [14]  are  usually 
estimated  from  a set  of  distinct  sample  strings  by  frequency  ratio.  The  limited 
string  sample  size  is  a source  of  error  in  estimation  and  the  final  classification 
performance.  The  error  accumulates  as  a sequence  of  production  rules  is  applied. 

A 

The  nonmonotonic  relation  between  confidence  for  p^  and  sample  size  (page  181  of 
[14])  is  rather  unexpected.  To  simplify  the  analysis,  assume  that  M distinct 
production  rules  have  to  be  applied  to  complete  a parse.  Let  e^  be  the 
difference  between  the  estimated  and  true  production  probabilities,  i.e. 


'ij 


PU  * Pij 


(26) 


Then  Ete^)  - 0,  cov.  (e^e^)  » - p^p^/n,  j + k where  n - Jn^  with  n 
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defined  by  Eq.  (6.4)  of  [14].  p^  approaches  p^  as  the  number  of  sample  strings 
t,  i.e.  the  sample  size,  approaches  infinity.  The  value  n is  proportional  to  t. 


J 


which  has  the  expected  value 


M - M _ n M 

E(  n p. .)  ■ n p..  - ^ IT  p. . + higher  order  terms  (27) 

j-1  XJ  j-1  n j-1 


If  we  Ignore  the  higher  order  terms  then  the  likelihood  function  based  cn  estimated 
production  probabilities  is  expected  to  be  off  from  the  true  value  by  an  amount 
inversely  proportional  to  the  sample  size  and  proportional  to  14^.  For  long  string 
the  accuracy  of  the  likelihood  function  may  thus  be  very  poor.  The  variance  of 
the  likelihood  function  can  also  be  determined.  It  appears  that  the  only  way  to 
reduce  the  finite  sample  effect  is  to  Increase  the  sample  size. 

VIII.  Concluding  Remarks 

This  paper  has  examined  some  current  problems  in  statistical  pattern 
recognition  especially  the  effects  of  finite  sample  size,  which  cause  the  gap 
between  theory  and  practice  in  pattern  recognition.  When  the  effects  are 
monotonic  then  the  best  way  to  reduce  such  effect  is  probably  by  increasing  the 
sample  size.  There  are  many  other  problems,  as  listed  in  Section  I,  in  statistical 
pattern  recognition  which  remain  to  be  studied  also.  Thus  we  believe  the  area 
should  remain  to  be  an  active  one  for  researchers. 
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